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TITLE 

IMAGE INSERTION IN VIDEO STREAMS USING A COMBINATION OF 
PHYSICAL SENSORS AND PATTERN RECOGNITION 



10 



15 Cross-Reference to Related Applications 

The pfesent application is related to and claims the. benefit of U.S. Provisional 
Application Serial No. 60/038,143 filed on November 27, 1996 entitled ^IMAGE INSERTION 
IN VIDEO STREAMS USING A COMBINATION OF PHYSICAL SENSORS AND PATTERN 
RECOGNITION". 

20 

The present application is also related to the following co-pending commonly 
owned applications: Serial No. 08/563,598 filed November 28, 1995 entitled "SYSTEM AND 
METHOD FOR INSERTING STATIC AND DYNAMIC IMAGES INTO A UVE VIDEO 
BROADCAST'; Serial No. 08/580,892 filed December 29, 1995 entitled "METHOD OF 

25 TRACKING SCENE MOTION FOR LIVE VIDEO INSERTION SYSTEMS"; Serial No. 
08/662,089 filed June 12, 1996 entitled "SYSTEM AND METHOD OF REAL-TIME 
INSERTTONS INTO VIDEO USING ADAPTIVE OCCLUSION WITH A SYNTHETIC COMMON 
REFERENCE IMAGE"; and Serial No. 60/031,883 filed November 27, 1996 entitled 
^^CAMERA TRACKING USING PERSISTANT, SELECTED, IMAGE TEXTURE TEMPLATES" The 

30 foregoing applications are all incorporated herein by reference. 

Background of the Invention 

L Field of the Invention 

This Invention relates to a system and method for tracking image frames for 
35 inserting realistic indicia into video images. 

2. Description of Related Art 

Electronic devices for inserting electronic images into live video signals, such as 
described In U.S. Patent 5,264,933 by Rosser, et a!., have been devetoped and used for 
the purpose of inserting advertising and other indicia into broadcast events, primarily 
40 sports events. These devices are capable of seamlessly and realistically Incorporating logos 



1 



wo 98/24242 



PCT/US97/21607 



\ 
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or other indicia into the original video in reai time, even as the original scene is zoomed, 
panned, or otherwise altered in size or perspective^ Other examples Include U.S. Patent 
No. 5,488,675 issued to Hanna and U.S. Patent No. 5,491,517 issued to Kreitman, et aL 

Making the inserted indicia look as if it is aauaily in the scene is an important but 
5 difficult aspect of implementing the technology. A troublesome aspect is that the eye of 
the average viewer is very sensitive to small changes in the relative position of objects 
from field to field. Experimentally, instances have been found where relative motion of an 
inserted logo by as little as one tenth of one pixel of an NTSC television image is 
perceptible to a viewer. Placing, and consistently maintaining to a high precision, an 

10 inserted Indicia in a broadcast environment is audal in making video insertion technology 
commercially viable. A broadcast environment includes image noise, the presence of 
sudden rapid camera motion, the sporadic occurrence of moving objects which may 
obscure a considerable fraction of the image, distortions in the image due to lens 
characteristics and changing light levels, induced either by natural conditions or by 

15 operator adjustment, and the vertical interiacing of television signals. 

In the prior art, the automatic tracking of image motion has generally been 
performed by two different methods. 

20 The first method utilizes pattern recognition of the frames and examines the 

image itself and either follows known landmarks in the video scene, using correlation or 
difference techniques, or calculates motion using well known techniques of optical flow. 
See, Horn,.B.K.P, and Schunck, B.G., "Determining Optical Flow", Artificial Intelligence, pp 
185-203 (1981). Landmarics may be b^nsient or permanent and may be a natural part of 

25 the scene or introduced artificiaily, A change in shape and pose of the landmarks is 
measured and used to insert the required indicia. 

The second method, described, for Instance, in U.S. Patent No. 4,084,184 issued 
to D.W. Grain, uses sensors placed on the camera to provide focal distance, bearing and 
30 elevation infonnatlon. These sensors exist to provide similar landmark positional data 
within a given camera's field of view. 

Pattern Recognition Systems 

In the pattern recognition type of Inriage insertion systems developed by Rosser et 
35 al., for instance, the system has two distinct modes. Rrst Is the search mode wherein 
each new frame of live video is searched in order to detect and verify a particular target 
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image. Second is the tracking mode, in which the system knows that In the previous 
frame of video the target image was present The system further knows the location and 
orientation of that previous frame with respect to some pre-defined reference coordinate 
system. The target Image locations are tracked and updated with respect to the pre- 
5 defined reference coordinate system. 

The search mode encompasses pattern recognition techniques to identify certain 
images. Obtaining positional data via pattern recognition, as opposed to using camera 
sensors, provides significant system flexibility because it allows live video Insertion systems 
to make an insertion at any point in the video broadcast chain. For instance, actual 
10 insertion can be performed at a central site which receives different video feeds from 
stadiums or arenas around the country or world. The various feeds can be received via 
satellite or cable or any other means known in the art. Once the Insertion Is added, the 
video feed can be sent back via satellite or cable to the broadcast location where it 
originated, or directly to viewers. 

15 Such pattern recognition search and tracking systems, however, are difficult to 

implement for some events and are the most vulnerable element prone to error during live 
video insertion system operation. The Assignee herein, Princeton Video Image, Inc., has 
devised and programmed robust searches for many venues and events such as baseball, 
football, soccer and tennis. However, the time and cost to implement similar search 

20 algorithms can be prohibitive for other types of events. Pattern recognition searching is 
difficult for events in which major changes to the look of the venue are made within hours, 
or even days, of the eyent This is because a pre-defined common reference image of the 
venue is difficult to obtain since the look of the venue is not permanently set. In such 
cases a more robust approach to the search problem is to utilize sensors attached to one 

25 or more of the cameras to obtain target positional data. 

Camera Sensor Systems 

The drawbacks of relying solely upon camera sensor systems are detailed below. 
In field trials with televised baseball and football games, previous systems encountered 
the following specific, major problems. 

30 1. Camera motion 

In a typical sport, such as football or baseball, close up shots are taken with long 
focal length cameras operating at a distance of up to several hundred yards from the 
action. Both of these sports have sudden action, namely Oie kicking or hitting of a ball, 
which results In the game changing abruptiy from a tranquil scene to one of fast moving 

35 action. As the long focal length cameras react to this activity, the image they record 
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displays several characteristics which render motion tracking more difficult For example, 
the motion of the image may be as fast as ten pixels per field. This will fall outside the 
range of systems that examine pixel windows that are less than 10 by 10 pixels. 
Additionally, the images may become defocused and suffer severe motion blurring, such 
5 that a line which in a static image is a few pixels wide, blurs out to be 10 pixels wide. This 
means that a system tracking a narrow line, suddenly finds no match or makes 
assumptions such as the zoom has changed when in reality only fast panning has 
occurred. This motion blurring also causes changes in illumination level and color, as well 
as pattern texture, all of which can be problems for systems using pattern based image 
10 processing techniques. Camera motion, even in as little as two fields, results in abrupt 
Image changes in the local and large scale geometry of an image. An image's illumination 
level and color are affected by camera motion as well. 

2. Moving 

15 Sports scenes generally have a number of participants, whose general motion 

follows some degree of predictability, but who may at any time suddenly do something 
unexpected. This means that any automata motion tracking of a real sports event has to 
be able to cope with sudden and unexpected occlusion of various parts of the image. In 
addition, the variety of unifomris and poses adopted by players in the course of a game, 

20 mean that attempts to follow any purely geometrk: pattern in the scene have to be able to 
cope with a large number of occurrences of simitar patterns. 

3. Lens distortion 

All practical camera lenses exhibit some degree of geometric lens distortion which 
25 changes the relative position of objects in an image as those objects move towards the 
edge of an image. When 1/lOth of a pixel accuracy is required, this can cause problems. 

4. Noise in the signal 

Real television signals exhibit noise, especially when the cameras are electronically 
30 boosted to coyer low light level events, such as night time baseball. This noise wreaks 
havoc with image analysis techniques which rely on standard normalized correlation 
recognition, as these nrwtch pattern shapes, irrespective of the strength of the signal. 
Because noise shapes are random, In the course of several hundred thousand fields of 
video (or a typical three hour game), the chances of mistaking noise patterns for real 
35 patterns can be a major problem. 

5. Fleld-to-field interlace 
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Television Images, in both tsTTSC and PAL standards, are transmitted in two 
vertically interlaced fields which together make up a frame. This means that television is 
not a single stream of images, but two streams of closely related yet subtly different 
images. The problem is particularly noticeable In looking at narrow horizontal tines, which 
5 may be very evident in one field but not the other. 

6. Illumination and color chance 

Outdoor games are especially prone to illumination and color changes. Typically, a 
summer night baseball game will start in bright sunlight and end in floodlight darkness. An 
10 iilumination change of a factor of more than two Is typical in such circumstances. In 
addition the change from natural to artificial lighting changes the color of the objects in 
view. For instance, at Pro Player Park in Rorida, the walls appear blue under natural 
lighting but green under artificial lighting. ' 

15 7. SetuD differences 

Cameras tend to be set up with small but detectable differences from night to 
night. For instance, camera tilt typically varies by up to plus or minus 1%, which is not 
immediately obvious to the viewer. However, this represents plus or minus 7 pixels and 
can be a problem to typical templates measuring 8 pixels by 8 pixels. 

20 

The advantages of camera sensors indude the ability to be reasonably sure of 
which camera is being used and where it is pointing and at what magnification the camera 
is viewing the image. Although there may be inaccuracies in the camera sensor data due 
to Inherent mechanical uncertainties, such as gear back-lash, these inaccuracies will never 
25 be large, a camera sensor system will, for instance, not miss-recognize an umpire as a 
goal post, or "think" that a zoomed out view of a stadium is a dose up view of the back 
wall. It will also never confuse motion of objects in the foreground as being movement of 
the camera Itself. 

30 What Is needed is a system that combines the advantages of both pattern 

recognition systems and camera sensor systems for searching and tracking scene motion 
while eliminating or minimizing the disadvantages of each. The primary difficulty in 
implementing a pattern recognition/camera sensor hybrid insertion system is the 
combining and/or switching between data obtained by the two completely different 

35 methods. If not done correctly, the combination or switch over gives unstable results 
which show up as the Inserted image jerking or vibrating within the overall image. 
Overcoming this difficult/ is crudal to making a hybrid system work well enough for 
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broadcast quality. 

Summary of the Invention 

By way of background, an LVIS, or iive video insertion system, is described In 
commonly owned application Serial No, 08/553,598 filed November 28, 1995 entitled 
5 "SYSTEM AND METHOD FOR INSERTLNG STATIC AND DYNAMIC IMAGES INTO A LIVE 
VIDEO BROADCAST". An LVIS Is a system and method for inserting static or dynamic 
images into a live video broadcast in a realistic fashion on a real time basis. Initially, 
natural landmarks in a scene that are suitable for subsequent detection and tracking are 
selected. Landmarks preferably comprise sharp, bold, and clear vertical, horizontal, 

10 diagonal or comer features within the scene visible to the video camera as it pans and 
zooms. Typically, at least three or more natural landmarks are selected. It Is understood 
that the landmarks are distributed throughout the entire scene; such as a baseball park or 
a football stadium, and that the field of view of the camera at any 'instant is normally 
significantly smaller than the full scene that may be panned. The landmarks are often 

15 located outside of the destination point or area where the insert will be placed because the 
Insert area is typically too small to include numerous identifiable landmarks and the 
insertable image may be a dynamic one and, therefore, it has no single, stationary target 
destination. 

20 The system models the recognizable natural landmarks on a deformable two- 

dimensional grid. An arbitrary, non-landmark, reference point is chosen within the scene. 
The reference point is mathematically associated with the natural landmarks and is 
subsequently used to locate the Insertion area. 

25 Prior to the insertion process, artwork of the image to be inserted Is adjusted for 

perspective, i.e., shape. Because the system knows the mathematical relationship 
between the landmarks in the scene, it can automatically determine the zoom factor and 
X, Y position adjustment that must be applied. Thereafter, when the camera zooms in 
and out and changes its field of view as it pans, the insertable Image remains property 

30 scaled and proportioned with respect to the other features in the field of view so that it 
looks natural to the home viewer. The system can pan into and out of a scene and have 
the insertable image naturally appear in the scene rather than "pop up" as has been the 
case with some prior art systems. The system can easily place an insertable image at any 
location. 

35 The present invention is a hybrid live video insertion system (LVIS) using a 

combination of pattern recognition techniques just described as well as others and camera 
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sensor data to locate, verify and track target data. Camera sensors are well suited to the 
search and detection, i.e. recognition, requirements of an LVIS while pattern recognition 
and landmark tracking techniques, including co-pending provisional application serial no. 
60/031,883 filed November 27, 1996 entitled "CAMERA TRACKING USING PERSISTANT, 
5 SELECTED, IMAGE TEXTURE TEMPLATES", are better suited for the image tracking 
requirements of an LVIS. 

The concept behind the present invention is to combine camera sensor data and 
optical pattern technology so that the analysis of the video image stabilizes and refines the 

10 camera sensor data. This stabilization and refinement can be done by substituting the 
camera sensor data for the prediction schemes used by standand LVIS systems for 
searching for and tracking landmark data, or by using the camera sensor data as yet 
another set of landmarks, with appropriate weighting function, in the model calculation 
performed by standard LVIS systems. Once the camera sensors have acquired the 

15 requisite data corresponding to landmarks in the scene, the data is converted to a format 
that Is compatible with and usabie by the tracking functions of the standard LVIS and the 
rest of the insertion process is canried out normally. 

Thus, the present invention takes advantage of camera sensor data to provide an 
20 LVIS with robust search capability independent of the details of the event location. 
Moreover, many of the disadvantages pertaining to camera sensor systems as desaibed 
above are overcome. 

The present invention comprises a typical LVIS in which one or more event 
cameras include sensors for sensing the zoom and focus of the lens, and the pan and tilt 

25 of the camera with respect to a fixed platform. For cameras in unstable locations, 
additional sensors are included which measure the motion of the substantially fixed 
platform with respect to a more stable stadium reference. For hand-held or mobile 
cameras, a still further set of sensors are included for measuring camera location and 
orientation with respect to a pre-determined set of reference positions. Sensor data from 

30 each camera, along with tally data from the production switcher, if necessary, is used by 
the LVIS to search for and detect landmaric data and thereby provide a coarse indication 
of where an insertion should occur in the current image. Tally data takes the form of an 
electronic signal indicating which camera or video source Is being output as the program 
feed by the video switcher. 

35 The sensors and tally data essentially replace the search mode of conventional 



1 



wo 98/24242 



PCT/US97/21607 



pattern recognition live video Insertion systenns. An accurate final determination of an 
insertion location is determined by using feature and/or texture analysis in tlie actual video 
image. TTiis analysis compares the position of the features and/or texture within the video 
frame to their corresponding location in a common reference image or previous image of 
5 the insertion location and surroundings as described in co-pending applications 08/580,892 
filed December 29, 1995 entitied "METHOD OF TRACKING SCENE MOTION FOR LIVE 
VIDEO INSERTION SYSTEMS" and 60/031,883 filed November 27, 1996 entitled "CAMERA 
TRACKING USING PERSISTANT, SELECTED, IMAGE TEXTURE TEMPLATES". 

10 Brief Description of the Drawings 

Fig. 1 Is a schenwtic representation showing a reference video image of a scene. 

Fig. 2 is a schematic representation showing a live video Image of the reference video 

image in Fig. 1. 

Fig. 3 is a table illustrating the elements of a typical representation of a reference array. 
15 Fig. 4 is Illustrates a schematic representation of field number versus y Image position in 
an interlace video field. 

Fig. 5a illustrates a cross-sectional view of zero mean edge template. 
Fig. 5b illustrates a plan view of a zero mean edge template 
Fig. 6 illustrates a correlation surface. 
20 Fig. 7 illustrates a measured and predicted position on a surface. 

Fig. 8 illustrates a schematic flow diagram of how a track, reference, and code hierarchy of 
reference arrays is used to manage an adaptive reference array. 

Fig. 9 illustrates a schematic view of landmarks and their associated sensor points used for 
color based occlusion. 

25 Fig. 10 is a schematic representation of an event broadcast using a combination of camera 
sensors and image tracking system. 

Fig. 11 is a block diagram desaibing the system of the present invention in which the 
camera data is used to predict landmark location. 

Fig. 12 is a block diagram desaibing the system of the present invention in which the 
30 camera data is used to provide extra "virtual" landmarks appropriately weighted to 
compensate for camera data errors. 

Fig. 13 illustrates a camera fitted with pan, tilt, zoom and focus sensors. 

Fig. 14 illustrates a representation of data output from an optically encoded sensor. 
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Fig. 15 illustrates the relationship between the transition of sensor track A, the state of 
sensor track B and the direction of rotation, clockwise (CW) or counter-clockwise {CON), 
of the sensor. 

Fig. 16 illustrates a common reference image taken from a broadcast image. 

5 Fig. 17 Illustrates a plot of Zoom (Image Magnification ) against Z (the number of counts 
from the counter attached to the zoom lens' zoom-element driver) with the focus-element 
of the lens held stationar/, Three other plots are overlaid on top of this Zoom against Z 
plot The three overlays are plots of Zoom (Image Magnification) against F (the number 
of counts from the counter attached to the zoom lens' focus-element driver) at three 
10 distinct, different and fixed settings of Z (the counts from the zoom-element driver). 

Fig. 18 illustrates a camera fitted with accelerometers (sensors) for detecting camera 
motion. 

Fig. 19 Illustrates three fixed receiving stations used to track the motion of a mobile 
camera fitted with a transmitter. 

15 Fig. 20 illustrates a broadcast situation in which the camera and an object of interest to 
the event, such as a tennis bail are both fitted with transmitters. 

Detailed Description of the Preferred Embodiment 

During the course of this description like numbers will be used to Identify like 
20 elements according to the different figures that illustrate the invention. 

The standard LVIS search/detection and tracking method, as desaibed in Serial 
1^0. 08/580,892 filed December 29, 1995 entitied "METHOD OF TRACKING SCENE MOTION 
FOR LIVE VIDEO INSERTION SYSTEMS", uses template correlation with zoom insensitive 
templates, such as edges, to follow a group of pre-designated landmarics or some subset 

25 of a group within a scene. Template correlation of landmarks provides raw position 
information used to follow the motion of a scene. Typically, the landmarks used may be 
parts of the structure in a ball park or markings on a field of play. Creating an ideal 
mathematical formulation of the scene to be tracked is a key part of the tracking 
algorithm. This ideal mathematical representation is referred to as the reference array 

30 and is simply a table of x,y coordinate values. The term "image" associated with the array 
is for operator convenience. Current images or scenes are related to this reference array 
by a set of warp parameters which define the mathematical transform that maps points in 
the current scene to corresponding points In the reference array. In the simple case in 
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which rotation is ignored or kept constant the current image is mapped to the reference 
array as follows: 

x"- a bx 
y'-d-hby 

5 

where x' and y* are the coordinates of a landmark in the current scene, x and y are the 
coordinates of the same landmark in the reference array and b is the magnification 
between the reference array and the current scene, a is the translation in the x direction 
and d is the translation in the y direction between the reference array and the current 
10 scene. 

The essence of adaptive, geographic hierarchical tracking is paying most attention 
to landmarks which are found at or dose to their anticipated model derived positions. 

15 The first step is to obtain an accurate velocity prediction scheme to locate the 

antidpated model derived position. Such a scheme estimates, via the warp parameters 
from the previous field or scene, where the landmarks in the current image should be. The 
primary difficulty with velocity prediction in interlaced video Is that from field to field there 
appears to be a one pixel y component to the motion. The present invention handles this 

20 by using the position from the previous like field, and motion from the difference between 
the last two unlike fields. 

Having predicted where in the current image the landmarks should be, template 
correlations over a 15 by 15 pixel region are then performed centered on this predicted 

25 position. These correlation pattems are then searched from the center outward looking for 
the first match that exceeds a threshold criteria. Moreover, each landmark has a 
weighting function whose value is inversely proportjonal to the distance the landmark Is 
away from its anddpated model derived position. When calculating the new warp 
parameters for the current scene, each landmark's cun-ent position is used weighted by 

30 this function. This gives more emphasis to landmarks which are closer to their predicted 
positions. 

A further step, necessar/ to compensate for camera distortion as the scene 
moves. Is to dynamically update the reference array coordinates of the landmarks based 
35 on. their cun'ent locations. This updating is done only on good landmarks, and Is Itself 
heavily weighted by the distance error weighting function. This adaptive reference array 
allows very accurate tracking of landmarks even as they pass through lens and perspective 
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distortions. The danger in having an adaptive reference array is that it may get 
contaminated. This danger is mitigated by having three sets of reference coordinates, 
which are referred to as the code, game and tracking reference coordinates. When the 
system is Initially loaded, the code reference coordinates are set to the original reference 
5 coordinates. The game and tracking coordinates are initially set equal to the code 
reference coordinates. Once the system locates a scene and begins tracking, the tracking 
coordinates are used. However, each time a scene cut occurs, the tracking coordinates are 
automatically reset to the game reference coordinates. At any time the operator may 
choose to set the current tracking coordinates equal to the game reference coordinates or 
10 to set the game reference coordinates back to the code reference coordinates. This 
scheme allows for adaptive reference updating with operator override capability. 

The final element In the tracking scheme is a method of determining when a 
landmark is obscured by some object, so as to avoid spurious data In the system, A color 

15 based occlusion method is used in which a set of sensor points in a pattern around where 
a landmark is found are examined and if they are found to differ from the colors expected 
in those regions, the landmark is deemed to be occluded and not used in further 
calculations. The sensor points from good landmarks are used to update the reference 
values for expected colors of the sensor points so that the system can accommodate 

20 changing conditions such as the gradual shift from sunlight to artificial light during the 
course of a broadcast. 

This strategy of adaptive, hierarchical tracking has proved to be a means of high 
precision and robust tracking of landmarks within video sequences even in the noisy, real 
25 world environment of live broadcast television. 

Referring to figure 1, motion tracking of video images which allow seamless 
insertion as practiced by this invention, starts with a reference array 10 of a scene in 
which insertions are to be placed. Although having an actual image is a useful mental aid, 

30 this reference array is nothing more than a set of idealized x,y coordinate values which 
represent the position of a number of key landmark sets 16 and 18 within reference array 
10, A typical table is shown in figure 3, illustrating the listing of x, or horizontal 
coordinates 31, and the y, or vertical coordinate positrans 33. The positions 31 and 33 of 
key landmark sets 16 and 18 are used both as references against which motion can be 

35 measured and In relation to which Insertk5ns can be positioned. A typical reference array 
10 of a baseball scene from a center field camera will consist of the locations of features 
such as the pitcher's mound 12, the back wall 14, vertical lines 15 between the pads which 
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make up the back wall 14, and the horizontal line 17 between the back wall and the field 
of play on which the horizontal set of landmarks 18 are set. 

The current Image or scene 20 is the field from a video sequence which is 
5 presently being considered. Locations of key features or landmark sets 16 and 18 from 
reference array 10 also are indicated in current Image 20 as measured positions 26 and 
28. Measured positions 26 and 28 are related to corresponding reference array landmark 
locations from sets 16 and 18 by a set of warp parameters which define a mathematical 
transform that most accurately maps the position of points In current image 20 to the 
10 position of points in reference array 10. Such mappings are well known mathematically. 
Sfig/ "Geometrical Image Modification in Digital Image Processing", W.K. Pratt 2nd Edition, 
1991, John Wiley and Sons, ISBN 0-471-85766. 

Tracking the view from a fixed television camera, especially one with a reasonably 
15 long focal length as in most sporting events, can be thought of as mapping one two- 
dimensional surface to another two-dimensional surface. A general mathematical 
transform that accomplishes such a mapping allowing for image to image translation, 
zoom, shear, and rotation is given by the following six parameter model: 

x' = a -f- bx -hqr 

20 y'^d^ex + fy 

where 

X and y are coordinates in reference array 10, 
x' and y' are the transformed coordinates in current image 20, 
a is the image translation in the x direction, 
25 b is the image magnification In the x direction, 

c is a combination of the rotation, and skew in the x direction, 
d is the image translation in the y direction, 
e is a combination of the rotation, and skew in the y direction, and 
f is the image magnification in the y direction. 

30 

The tracking algorithms and methods discussed herein can be used with the above 
transformation as well as other more general transformations. However, experience has 
shown that with a dynamically updated reference array, a simpler x,y mapping function 
which assumes no shear or rotation will suffice. Thus, in the simple case in which rotation 
35 is ignored or kept constant (c = e = 0) and tiie magnification in the x and y directions is 
the same (b f) the position of points in current image 20 are mapped to the position of 
points in reference array 10 using the following equations: 
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A'' - a -f bx 
y'= d -h by 

where x' and y' are coordinates of a iandmark in current image 20, x and y are 
coordinates of the same landmark in reference array 10, b is the magnification between 
5 reference array 10 and current image 20, a is the translation in the x direction, and d is 
the translation in the y direction. This simplified mapping scheme is used because 
experience has shown it to be both robust and capable of handling the limited shear, 
rotation, and perspective distortion present in television sports broadcasts when a 
dynamically updated reference array Is used. 

10 

Motion tracking is the method of measuring positions of landmark sets 26 and 28 
in current image 20 and using these measurements to calculate the warp parameters a, d 
and b, as defined by the equations above. An important part of adaptive geographic 
hierarchical tracking is the concept of assigning a weight to each landmark. Weights are 
15 assigned, in inverse proportion, according to the distance each landmark is detected away 
from where it is expected or predicted to be found. The closer a landmark is found to 
where it is predicted to be, the greater the weight given to that landmark in the 
calculation of the warp parameters linking the positions in current image 20 to the 
positions in reference array 10. 

20 

The first step is predicting where the landmarks 26 and 28 should be in current 
image 20. This is done by analyzing the landmark positions in the three previous fields. 
The previous position and velocity of a landmark derived from the previous model Is used 
to estimate where the landmark will appear in the current image 20. The position and 
25 velocity calculations are complex in that both the current standard methods of television 
transmission, NTSC and PAL, are sent in two vertically interlaced fields. Thus, alternate 
horizontal scans are Included In separate fields, customarily referred to as odd and even 
fields. In the HTSC system, each field is sent In l/60th of a second (16.6 msecs), making a 
combined single frame every l/30Wi of a second. 

30 

One important practical consideration in the velocity estimations is that the x and 
the y positions In the previous fields (-1, -2 and -3) that are used in the velocity 
estimations are not the measured positfons, but the positions calculated using the final 
warp parameters derived in each of those fields. That is, in each field, x and y positions 
35 are measured for each landmark. All of the landmarks are then used to derive a single set 
of warp parameters a, b and d giving the mapping between the current and the reference 
array. That single set of warp parameters is then used to project the reference array 
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coordinates 10 into the current image 20, giving an idealized set of landmark positions in 
the current image. It is this Idealized set of landmark positions In each field, referred to as 
the model derived positions, that are used in the velocity predictions. 

5 As illustrated in figure 4, the current y or vertical position of a landmark is 

predicted from the previous three fields. The y position in the current field (field 0) is 
predicted by measuring the y component of velocity as the difference between the 
landmark's model derived position in field -1 and field -3, which are "like" fields in that 
they are both either odd or even. The y velocity component is then added to the model 
10 derived y position in field -2, which is the previous field "like" the current field, to arrive at 
an estimate of where to find that landmark in the current field. 

The prediction in the x direction could use the same algbrithm or, since there Is no 
interlace, the x direction calculation can be simpler and slightly more cun-ent. In the 
15 simpler scheme, the x component of the velocity is calculated as the difference between 
the landmark's model derived position in field -1 and its model derived position in field -2. 
This difference is then added to the model derived position in field -1 to arrive at an 
estimate of where to find that landmark in the current field. 

20 Having predicted the most likely position of ail the landmarks in the current Image, 

the positions of the landmarks are then found by doing a correlation of an 8 by 8 pixel 
template over a 15 by 15 pixel region centered at the predicted position. Conreiation or 
template matching is a well known technique, and in its standard foon is one of the most 
fundamental means of object detection. See, Chapter 20, "Image Detection and 

25 Recognition of Digital Image Processing" by W.K. Pratt {2nd Edition, 1991, John Wiley and 
Sons, ISBN 0-471-85766). Unlike more standard methods of correlation or template 
matching in which the template is made to closely resemble the part of the scene it is 
being used to find, the templates in the present invention are synthetic, idealized both in 
shape and value, and are "zero-mean". 

30 

For instance, in tracking a football goal post upright, rather than use a portion of 
the goal post taken from the image, the template 54 used Is an edge of uniform value 
made from a negative directed line 56 and a positive directed line 58, and the sum of the 
values in the 8 by 8 template is equal to zero as shown schematically in cross-section in 
35 figure 5a and in plan view in figure 5b. 

This template has the advantages of being zoom Independent and will gh/e a zero 



14 



wo 98/24242 



PCTAJS97/21607 



value on a surface of uniform brightness. The technique is not limited to 8 by 8 pixel 
templates, nor Is the region over which they are correlated limited to 15 by 15 pixel 
regions. Further, this technique is not limited to zero mean templates either. In 
circumstances where only vertical and horizontal lines and edges are being tracked it is 
5 possible to reduce computation by having (1 x n) correlation surfaces for following the 
horizontal detail, and (n x 1) correlation suri^ces for following the vertical detail where n is 
any reasonable number, usually in the range of 5-50 pixels. 

The Idealized, zero-mean edge template 54 is correlated over a 15 by 15 pixel 
region of the current image or some amplified, filtered and decimated replica of it to 
produce a correlation surface 60 as shown schematically in figure 6, This correlation 
surface 60 consists of a 15 by 15 array of pixels whose brightness correspond to the 
correlation of the Image against the template when centered at that position. Typically, an 
edge template 54 correlated over a region of an Image containing a line will give both a 
positive going line response 66, indicating a good match and a corresponding negative 
going line 67 indicating a mismatch. This mismatch line 67 can be useful in that its 
position and distance away from the positive going match line 66 give a measure of the 
width of the line and whether it Is brighter or darker than its surroundings. In addiUon, 
there will be other bright pixels 68 on the correlation surface 60 corresponding to bright 
edge like features in the current image. 

A guiding principle of the adaptive-geographic-hierarchical tracking method Is to 
focus on landmarks and the correlation peaks indicating potential landmarks that are 
closest to where they are expected to be. Rather than just looking for a peak anywhere on 
25 the 15 by 15 correlation surface 60, these patterns are searched from the center outward. 
The simplest, and very effective, way of doing this is to first look at the central nine pixel 
values In the central 3 by 3 pixel region 64. If any of these pixels has a correlation value 
greater than a threshold then it is assumed that the pixel represents the landmark being 
sought and no further investigation of the con-elation surface is done. The threshold is 
30 usually fifty percent of the usual landmark correlation antidpated. This 3 by 3 Initial 
search allows motion tracking even In the presence of nearby objects that by their 
brightness or shape might confuse the landmark correlation, such as when the pixel 
marked 68 had been brighter than the pixels in the line 66. Once the pixel with the peak 
brightness is found, an estimate of the sub pixel position is found using the well known 
35 method of reconstructing a triangle as discussed in co-pending U.S. Pat. Appl. No. 
08/381,088. There are other sub pixel position estimating methods that may be used such 
: as fitting higher order curves to the data. 
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In addition, each landmark found in a scene has an error weight associated with it 
based on its distance from where it is expected to be. Referring now to figure 7, the 
calculation of this error weight is based on the predicted position in the image 70, at the 
5 coordinates xp, yp and the measured position In the image 72, at the coordinates xm, ym, 
using the general equation: 



ErrorWeight^ 1 t — r 

h + (i((xp-xmy + (yp^ymf))' 



where g, h, i, J, k, and I are numerical constants chosen td vary the strength of the 
weighting function. 

10 In the preferred embodiment the parameters of the equation are: 



ErrorWeight- 



LO+((xp'Xm f+(yp-ym ff 



although in special circumstances, each of the parameters might have a different value to 
change the emphasis of the weighting. For instance, numerical constants i and j may be 
varied to provide a function which stays constant for a short distance and then drops 
15 rapidly. 



This error weight Is then used in the calculation of the warp parameters which 
maps points in the current image 20 to the positions in the reference array 20. In the 
preferred embodiment this calculation is a weighted least mean squares fit using the 
20 following matrix: 



'lYCi • CI) iXnx • CI) YXny • Clj 
Zf>i^ • CI) • ^) lU^ • ^y) 

YX^y • Cl) STnjc •ny)YXny ny) 



'b 




'YXCl*C2j 


a 






d 




_I/nyC2) 



where 



01 -nx9 ErrorWeighfxp+ny Error Weigh fyp 
CI-nx^ErrorWeight^xm+ny^ErrorWeight^ym 
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In the case of purely horizontal landmarks, nx=0 and ny=l and In the case of 
purely vertical landmarks nx=l and ny=0. In the more general case, nx and ny are the 
5 direction cosines of vectors representing the normal to the landmarks predominant 
direction. 

The adaptive part of the motion tracking scheme is necessary to allow for camera 
distortion. It also allows the system to compensate for small discrepancies between the 

10 stored idealized reference array and the actual scene as well as allowing the system to 
handle small slow rotation and/or shear. It further allows the system to handle any small 
and slowly occurring distortions. This adaptation is done by dynamically updating the 
reference array coordinates based on their current locations. In the present invention the 
adaptive part of the motion tracking is made stable by the following criteria: 1) being very 

15 careful when it is allowed to occur; 2) choosing which landmarks are aitowed to participate 
based on how confident the system is that said landmarks are good; and 3) having the 
whole calculation heavily weighted by the distance error weighting function. In addition, 
the reference array is reset after any scene cuts. 

20 In the preferred embodiment the dynamic updating of the reference coordinates is 

started after six fields of tracking and is only done on landmarks which have not been 
flagged by any occlusion checks and have correlation values greater than 20% and less 
than 200% of expected reference values, though different values may be used for all 
these parameters. 

25 The measured landmark positions are back projected to the positions in the 

reference array using the warp parameters calculated by all the good landmarks in the 
current field using the equations; 

Xnr = (Xm'a)/b 
Ynr = (Ym-d)/b 

30 Xr^^XOr-h (EmrWdghtf (Xnr - XOr) 

Yr = YOr + (EmrWdghtf (Ynr - YOr) 

where: 

Xm is the measured x coordinate of the landmark, 
Ym is the measured y coordinate of the landmark, 
35 a Is the horizontal translation warp parameter, 

d is the vertical translation warp parameter, 
b is the magnification warp parameter, 
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Xnr is the calculated x coordinate of a proposed new reference point based on 
this field's data, 

Ynr Is the calculated y coordinate of a proposed new reference point based on 
this field's data, 

5 XOr is the x coordinate of the old reference point prior to update, 

YOr is the y coordinate of the old reference point prior to update, 
Xr Is the x coordinate put into the table as the new reference point, and 
Yr Is the y coordinate put into the table as the new reference point. 

10 It IS also possible to use separate tracking reference arrays for odd and even fields 

to innprove the tracking performance with interlace video. Because of the potentially 
unstable nature of the adaptive reference array, the preferred embodiment has three 
related reference arrays, referred to as the: CODE REFERENCE, GAME REFERENCE, and 
TRACKING REFERENCE. 

15 

The schematic flow diagram in figure 8 illustrates how these three references are 
used. At start up, when the Initial system is loaded, all three references are set to be the 
same, i.e. CODE REFERENCE = GAME REFERENCE = TRACKING REFERENCE, which is to 
say that the x and the y coordinates of the landmarks in each of the reference arrays are 
20 set to be the same as the coordinates of the landmarks in the code reference array. 

At run time, when the Image processing is done, the three reference arrays are 
used in the following manner. The game reference is used in search and verify mode and 
in tracking mode the tracking reference is used. 

25 

Initially the tracking reference array is set equal to the game reference array. In 
the preferred embodiment this occurs on the first field in which the tracking is done. In 
subsequent Rekis the tracking reference is modified as detailed above. If separate 
tracking reference arrays are being used for odd and even fields they would both initially 
30 be set to the game reference array. 

At any time during the tracking mode, the operator may elect to copy the current 
tracking references into the game reference using standard computer interface tools such 
as a screen , keyboard, mouse, graphic user interiace, trackball, touch screen or a 
35 combination of such devices. This function is useful at the start of a game. For instance, 
an operator may be setting up the live video insertion system to perform insertions at a 
particular stadium. The code reference coordinates have landmark positions based on a 
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previous game at that stadium but the position of the landmarl<s may have been subtly 
altered in the intervening time. The code reference, however, remains good enough for 
search and tracking most of the time. Altemativeiy, by waiting for a shot, or having the 
director set one up prior to the game, in which ali the landmarks are dear of obstruction, 
5 and allowing for the adjustment of the tracking reference to be completed, a more 
accurate game reference for that particular game can be achieved. 

At any time, in either the tracking or search mode, the operator can elect to reset 
the game reference to the code reference. This allows recovery from operator error in 
10 resetting the game reference to a corrupted tracking reference. 

An important part of the adaptive reference process is restricting the updating to 
landmarks which are known to be un-ocduded by objects such as players, The method 
used for this landmark occlusion detection in the preferred enribodiment Is color based and 

15 takes advantage of the fact that most sports are played on surfaces which have well 
defined areas of fairly uniform color, or in stadiums which have substantial features of 
uniform color, such as the back wall in a baseball stadium. Each landmark 90 as shown in 
figure 9, has sensor points 92 associated with it. These sensor points 92, which In the 
preferred embodiment vary from 3 to 9 sensor points per landmark 90, are pixels In 

20 predetermined locations close to, or preferably surrounding the landmark they are 
assodated with. More importantly, the sensor points are all on areas of reasonably uniform 
color. The dedsion on whether the landmarks are ocduded or not Is based on looking at 
the sensor points and measuring their deviation from an average value. If this deviation 
exceeds a pre-set value, the landmark Is presumed to be occluded. Otherwise It is 

25 available for use in other calculations, such as the model calculations and the reference 
array updating. 

The discussion up until this point has described the LVIS search/detect and tr^ck 
features of co-pending application serial no. 08/580,892 filed December 29, 1995 entitled 
30 "METHOD OF TRACKING SCENE MOTION FOR LIVE VIDEO INSERTION SYSTEMS" 

The concept of the present invention is to augment the velocity prediction scheme 
of a standard LVIS with camera sensor data. While such action may sound Wvial, it is In 
fact a complex undertaking that requires synchronicity between different data formats. 
35 Camera sensor data provides a "snap-shot" of a complete image field which can be 
reduced to a two-dimensional image coordinate array where the entire Image array is 
mapped all at once, I.e. at a single instant In time. That Is to say, the pixels on the left 
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side of the array represent the same instant in time as the pixels on the right side of the 
array. Motion tracking using a standard LVIS technique, however, is a contlnuaiiy 
updating process with respect to the image array coondinates. Thus, at any given instant, 
the pixels on the left side of an image array do not represent the same instant in time as 
5 the pixels on the right side of the image array. For the hybrid system of the present 
invention to perform seamlessly, such anomalies must be accounted and compensated for. 

Referring to Fig. 10, there Is a camera 110 having lens 112 mounted on a tripod 
mount 111, set up to record a tennis match on a tennis court 115. The camera 110 and 
lens 112 are fitted with a set of sensors 113 designed to measure the pan, tilt, zoom and 
focus of the lens 112 and camera 110. Sensors 113 also determine whether double 
magnification optics are being used. Broadcast cameras usually have a "doubler" element, 
which can be switched in or out of the lens' train of optical elements at the turn of a knob. 
Use of this doubler effectively doubles the image magnification at any given setting of the 
lens' zoom-element. This means that a single reading of Z (the counts from the zoom- 
element driver) is associated with two different values of zoom or image magnification. 
Data gatherer 114 receives data from camera sensors 113 before feeding same to a Live 
Video Insertion System (LVIS) 118 having a data interpreter 116. Data interpreter 116 
converts data forwarded by data gatherer 114 into a form that can be used by the LVIS 
system. Other similar cameras with sensors are positioned throughout the event site for 
recording different views of the action. 

Rg. 10 also shows some of the usual broadcast equipment, such as a switcher 
120, used in a television production. A switcher allows the director to choose among 
25 several video sources as the one currently being broadcast. Examples of other video 
sources shown in Rg, 10 include additional cameras 110 or video storage devices 122. 
Switcher 120 may also include an effects machine 124 such as a digital video effects 
machine. This allows the director to transition from one video feed to another via warpers 
or other Image manipulation devices. Warpers are innage manipulation devices that 
30 translate an image from one perspective to another, such as, for instance, a change in 
zoom, pan, or tilt. 

The program feed is next sent to an LVIS 118. In addition to the 
search/detection, i.e. recognition, and tracking abilities of a typical live video insertion 
system, the LVIS 118 of the preferred embodiment of the present invention further 
35 includes a data Interpreter 116. Data interpreter 116 interprets camera sensor data from 
data gatherer 114 and tally information received from switcher 120 thereby informing LVIS 
lis which video source is currently being broadcast. LVIS 118 is further equipped with 
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software and hardware decision module 126. Decision module 126 allows LVIS 118 to use 
sensor data in place of traditional search mode data obtained via the pattern recognition 
techniques previously described. Dedsion module 125 can switch between a conventional 
pattern recognition tracking mode or a mode where tracking Is done via a combination of 
5 camera sensor data and pattern recognition. 

Once the video has passed through LVIS 118 an indicia 136 is seamlessly and 
realistically inserted in the video stream. The Insertion may be static, animated, or a live 
video feed from a separate video source 128. The resultant video signal is then sent via a 
suitable means 130, which may be satellite, aerial broadcast, or cable, to a home receiver 
10 132 where the scene 135 with inserted indicia 136 is displayed on a conventional television 
set 134. 

Referring now to Fig. 13, the set of sensors that determine the pan and tilt of 
camera 110 comprise precision potentiometers or optical encoders designed to measure 
the rotation about the horizontal 146 and vertical 142 axes. Similar sensors also 
determine the focus and zoom of lens 112 by measuring the translation of optical 
elements within lens 112. Focus and zoom motion are determined by measuring the 
rotation of the shafts that move the optical elements that define focus and zoom. This is 
done by measuring the rotation about axis 150 of handle 148 used by the camera operator 
to change zoom, and about axis 154 of handie 152 used by the camera operator to effect 
changes in focus. 

Data from pan sensor 140, tilt sensor 144, zoom sensor 149, and focus sensor 153 
are coilected by data gatherer 114, Data gatherer 114 then takes the raw voltages and/or 
sensor pulses generated by the various sensors and converts them into a series of 
numbers in a fomnat that can be transmitted to data interpreter 116 of LVIS 118. Data 
25 Interpreter 116 may be located remotely or on-site. Data gatherer 114 may take the form 
of a personal computer equipped with the appropriate communications and processing 
cards, such as standard analog-to-digital (A/D) converter cards and serial and parallel 
communications ports. 

For potentiometer data, such as zoom sensor 149 and focus sensor 153, data 
30 gatherer 114 converts an analog voltage, typically in the range -3 to +3 volts, Into a digital 
signal which is a series of numbers representing the position of the lens. These numbers 
may be gathered at some predetermined data rate such as once per video field or once 
ever/ 6 milliseconds and forwarded to data interpreter 115 of LVIS 118. Or, LVIS 118 may 
send a request to data gatherer 114 requesting an update on one or more of the 
35 parameters being used. 



15 



20 



21 



wo 98/24242 



PCT/US97/21607 



Data from a typical optical encoder is in three tracks as illustrated in Fig. 14. Each 
track consists of a series of binar/ pulses. Tracks A and B are identical but are a quarter 
period out of phase with one another. A period is the combination of a low and a high 
pulse. In a typical optical encoder one rotation of the sensor device through 360 degrees 
5 will result in approximately 40,000 counts where a count is each time the encoder output 
goes from 0 to +1 or from +1 to 0. The reason for having two data tracks a quarter 
period out of phase Is to Inform data Interpreter 116 which direction the sensor is being 
rotated. As illustrated In Fig. 15, if track A Is making a transition then the state of track B 
determines whether the sensor is being rotated clockwise or counter-clockwise. For 
10 instance, if track A is making a transition from a high state to a low state and if track B is 
in a high state then the sensor is rotating clockwise. Conversely, if track B is in a low state 
the sensor is rotating counter-clockwise. 

I 

By studying the tracks A and B, data gatherer 114 can monitor sensor position 
simply by adding or subtracting counts as necessary. All that is needed is a reference 
1 5 point from which to start counting. The reference point is provided by track C. Track C 
has only two states, +1 or 0. This effectively defines a 0 degree point and a 180 degree 
point. Since in a practical, fixed camera setup the arc through which the camera is rotated 
is less than 180 degrees, we need only consider the zero setting case. 

By monitoring track C transitions, data gatherer 114 is able to set the rotation 
20 counters to zero and then increment or decrement the counters by continuously 
monitoring tracks A and B. At suitable intervals, such as once per field or once every 6 
milliseconds, the rotation position of the optical sensor can be fonwarded to data 
interpreter 116. Alternately, at any time, LVIS 118 may send a request to data gatherer 
114 for a current measurement of one or more of the parameters being monitored. 

25 The function of data interpreter 116 is to convert the digitized position and/or 

rotational information from data gatherer 114 Into a format compatible with and usable by 
a typical LVIS traddng system. Referring to Fig. 16, sensor data from the camera and lens 
is made compatible with the LVIS tracking system by means of a common reference 
image. 

^0 "I"he common reference image is a stored image that allows for mathematical 

modeling or translation between a conventional LVIS tracking system, such as that 
desaibed in commonly owned application Serial No. 08/580,892, entitled "METHOD OF 
TRACKING SCENE MOTION FOR LIVE VIDEO INSERTION SYSTEMS" and a system relying 
exclusively on camera sensor data. Typically, the common reference image is modeled 

35 upon the chosen tracking method, i.e. adaptive geographical hierarchical or texture 
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analysis for instance, and the camera sensor data is translated to that chosen tracking 
model. 

There are several important aspects to the common reference image. First is 
origin. The origin is chosen as the point at which the camera lens optical axis goes through 
5 the common reference image. TTiis is typically not the center of the video image for two 
reasons. Rrst, there may be a slight misalignment between the axis of the zoom elements 
of the lens and the optical axis of the main lens components. Second, the CCD array of 
the camera may not be exactly perpendicular to the optical axis of the lens. 

This offset can be handled one of two ways. Rrst, a zoom dependent skew 
10 parameter can be added to the Interpretation of the data. Or, second, a zero point within 
the common reference image can be defined at the point where the camera lens optical 
axis crosses the common reference image. TTie zero point can ibe determined In practice 
in a number of ways. The preferred method first sets up a cross hair on the Image at the 
center of the image. Second, zoom in on a fiducial point. A fiducial point is a fixed or 
15 reference point Next, pan and tilt the camera until the crass-hair is centered on the 
fiducial point. Then zoom out as fer as possible. Now move the cross hair on the image 
until It is centered again on the fiducial point. Lastly, repeat the second and third steps 
until the cross-hair stays centered on the fiducial point as the camera Is zoomed in and 
out. The X, y coordinates of the fiducial point and of the cross-hair are now the (0,0) 
20 points of the common reference image, i.e. the origin. 

The common reference Image shown in Rg. 16 is an image of a stadium or event 
taken at some intermediate zoom with a known setting of the camera parameters pan, tilt, 
zoom, and focus. The common reference image is a convenience for the operator. For 
convenience, we make the foltowing definidons: P = Pan counts (the number that pan 
encoder 40 is feeding to the data interpreter); T = Tilt counts (the number that tilt 
encoder 44 Is feeding to the data Interpreter); 2 = Zoom counts (the number that zoom 
encoder 49 is feeding to the data interpreter); and F = Focus counts (the number that 
focus encoder 53 Is feeding to the data interpreter). Camera sensor readings are also 
recorded contemporaneously with the common reference Image and are given the 
30 following designations: 2o = 2 at the taking of the common reference image; Fo = F at 
the taking of the common reference image; To = T at the taking of the common reference 
image; Po = P at the taking of the common reference image; and (Xo,Yo) are the 
coordinates In the common reference image of the (0,0) point defined above. 

TTiree calibration constants are required to translate the camera sensor data Into a 
35 fbmi usable by a conventional LVIS Image tracking system. These constants are: xp, the 
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number of x pixels moved per count of the pan sensor at Zo, Fo; yt, the number of y 
pixels moved per count of the tilt sensor at Zo, Fo; and zf, the number of the Z count 
equivalent of the F count sensor at Zo. xp and yt are related by a simple constant but 
have been Identified separately for the sake of clarity. 

5 Fig. 17 is a linear plot of Z, the counts from the zoom counter along the x-axis, 

versus the zoom along the y-axts. The zoom at the common reference image settings is 
the unit zoom. As can be seen from the dotted lines, a side effect of adjusting the camera 
focus element is an alteration in the image magnification or zoom. The nature of the 
alteration is very similar to the nature of the alteration in image magnification produced by 
10 zoom adjustment However, the change In Image magnification (zoom) brought about by 
adjusting the focus-element through its entire range is significantly smaller than the 
change in Image magnification brought about by adjusting the camera zoom element 
through its entire range. 

This can be understood graphically by considering two sets of plots. First, a graph 
15 is made of Image Magnification (Zoom) vs. the adjustment of the zoom elements of the 
lens (as measured by counting the number of rotations, Z, of the screw shaft moving the 
zoom-elements in the zoom lens), with the focus-element of the zoom lens kept at a fixed 
setting. This first plot is called the Magnification vs. Zoom plot. 

Second, a number of graphs are made of Image Magnification vs. the adjustment 
20 of the focus element of the lens (as measured by counting the number of notations, F, of 
the screw shaft moving the focus-elements In the zoom lens) at a number of distinct 
settings of Z, the position of the zoom-element. These graphs are called the Magnification 
vs. Focus plots. 

The Magnification vs. Focus plots can then be overlaid on to the Magnification vs. 
25 Zoom plot. By compressing the focus axis of the Magnification vs. Focus plots, the shape 
of the Magnification vs. Focus curve can be made to match the local curvature of the 
Magnification vs. Zoom plot, as shown in Fig. 17. 

The important point is that the degree of compression of the Focus axis necessary 
to make the Focus curves match the Zoom curve is the same for each of the Magnification 

30 vs. Focus curves, despite their being made at different, fixed values of Z. This means that 
it is possible to simplify the mathematics of the interaction of zoom and focus on the 
image size by treating zoom and focus adjustments in a similar fashion. In particular, in 
determining image size or magnification, it is possible to interpret the data from the focus 
sensor (the counter measuring the position of the focus-element) as being equivalent to 

35 data from the zoom sensors (the counter measuring the position of the zoom-element). 
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Ail that is needed to make the Zoom and Focus data equivalent is a simple modification of 
the Focus data by a single off-set value and a single multiplication factor. Equivalent zoom 
counts are defined by: 

Zee ^zf(F-Fo) 

5 zf is a calibration constant determined by plotting zoom against Z counts, and 

then overlaying the zoom against F counts at particular zooms. By adjusting the F counts 
so that the zoom from the focus fits the zoom curve, the constant rf can be found. The 
same thing can be done analytically by first detennining the relationship between zoom 
and Z counts, and the using that relationship to fit zoom to F counts, by adjusting zf . 

10 In the preferred embodinnent, zoom was fitted to Z using the following 

exponential function using a least squares fit: 

2 ' 

There may also be a lookup table to convert the raw zoom counts into zoom, or a 
combination of lookup table and a mathematical interpolation which may be similar to the 
expression in the equation above. 

15 Calibration constants xp and yt are measured by pointing the camera at one or 

more points in the common reference image, i.e. centering the cross-hair on the optical 
axis of the iens and recording the P and T values. By measuring the pixel distance in the 
common reference image between the selected points and the (0,0) point, calibration 
constants xp and yt are calculated by means of the following two equations: 

20 xp:=(X-Xo)/(Y'Yo) 

yt = (Y'Yo)/(T'To) 

Constants xp, yt, zf, a, b and c are used with reference constants, Zo, Fo^ Po/ To, 
Xo and Yo to relate P, Z, T, and F to the affine coefficients used by conventional LVIS 
image tracking software, or to calculate the position of a point In the current image whose 
25 location is known with respect to a reference array of the common reference image. 

In the simplest affine representation, ignoring rotation and assuming zoom is the 
same in the x and y directions the position of an object can be related to its position in the 
common reference image by the equations: 
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where xi and yi are the x and y position of an object In the current inriage, Xr and yr are 
the X and y position of the same object in the common reference image, Z is the zoom 
between the current image and the common reference image, and tx and ty are x and y 
5 translations between the current Image and the common reference image. In the 
conventional LVIS tracking equations, Z, tx and ty are solved for by measuring the position 
of a set of known landmarks, using a weighted least squares fit. Having found Z, tx and 
ty, any other point in the common reference image can then be mapped into the current 
image using the equations for xi and yi. 

10 

From equations above it can be seen that Z is simply: , 

t 

^ = 1 

where M is the combined zoom and focus counts as defined by: 

tx and ty are found from the camera sensors using the relationships: 
15 tx=xp(P-Po) 

ty = ytCr-To) 

In the preferred embodiment, data interpretation unit 116 is either software or 
hardware implementation, or a combination of software and hardware implementation of 
the equations converting sensor data T, Z and F into Z, t% and ty, having been 
20 calibrated by defining Po, To, Zo, Fo, Xo, Yo, rf, xp and yt. 

The X and y position of a point can be expressed directly In terms of Po/ To, Zo, 
Fo, Xo, Yo, 2f, xp and yt by: 

Xf = XrZ -h xp (P'Pq) 

y}=yrZ-hyt(T-To} 

25 Whichever implementation is used, the implementation in hardware or software 

may be by the analytic expressions detailed above, by lookup tables which express or 
approximate the expressions, the experimental data the expressions were derived from, or 
by a combination of lookup tables, analytic expressions and experimental data. 
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The LVIS can now use the translated cannera sensor data in a number of ways. 
Whichever nfiethod is used, however, must compensate for lens distortion of the particular 
lens being used. 

One method for using the translated camera data is to use the Z, U and ty affine 
5 conversion for search only, and then switch to conventional tracking. This means that the 
lens distortion can be compensated for conventionally by having a deformable common 
reference image as described in detail in commonly owned co-pending applications Serial 
Nos. 08/563,598 and 08/580,892 entitled "SYSTEM AND METHOD FOR INSERTING STATIC 
AND DYNAMIC IMAGES INTO A QVE VIDEO BROADCAST' and "METHOD OF TRACKING 
10 SCENE MOTION FOR LIVE VIDEO INSERTION SYSTEMS" respectively. 

A second application for using translated camera data is to use it to supplement 
the tracking capability of the system by using the Z, tx and ty affine conversion to create 
one or more image-centric landmarks, which are always, visible, but which have a 
weighting factor that always gives an error of about 2 pixels, and then feed these extra 
15 landmarks into a matrix based landmark tracking system as explained in detail In 
co-pending patent application serial no. 08/580,892 filed December 29, 1995 entitled 
^METHOD OF TRACKING SCENE MOTION FOR LIVE VIDEO INSERTION SYSTEMS". The 
flexible common reference image would have to be extended to include flexible camera 
reference parameters. 

20 A third method for using the translated camera data is to supplement the tracking 

capability of the system by using the Z, tx and ty affine conversion to predict, or as part of 
the prediction, where optical tracking landmarks should be in the current image, and then 
use landmark or texture tracking to improve whatever model Is being used to relate the 
current image to the reference array to the extent that recognizable structure is available. 

25 Texture .tracking is described in co-pending provisional application serial no. 60/031,883 
filed November 27, 1996 entitled ''CAMERA TRACKING USING PERSISTANT, SELECTED, 
IMAGE TEXTURE TEMPLATES". This approach can be used for any model representation 
including full affine and perspective. Distortion compensation is more difficult, especially if 
the supplementation is going to be modular - i.e. available on, for instance, the zoom, x 

30 offset (or horizontal translation) and y offset (or vertical translation) separately and in any 
combination thereof. One robust way Is to have a function or look up table that maps the 
distortion. 

Having determined the model relating the current image to the common reference 
image, the remainder of the LVIS, including insertion ocdusion, can be used nomrially, as 
35 described in detail in co-pending patent application Serial No. 08/662,089 entitled 
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"SYSTEM AND METHOD OF REAL-TIME INSERTIONS INTO VIDEO USING ADAPTIVE 
OCCLUSION WITH A SYNTHETIC COMMON REFERENCE IMAGE". 

In an alternative embodiment of the invention illustrated in part in Fig. 18, in 
addition to the pan, tilt, zoom and focus sensors 113 aiready described, there are two 
5 additional sensors 160 and 164 fitted in the transition module by which the camera 110 
and iens 112 are attached to the tripod mount 111. These additional sensors 160 and 164 
are accelerometers which measure acceleration in two orthogonal directions 162 and 166. 
The data from the accelerometers Is fed to the data gathering unit 114, where it is 
integrated twice with respect to time to provide the current displacement of the camera In 
10 the X and y directions. Displacement data is fed to data interpreting unit 116, where it is 
multiplied by some previously determined calibration constant, and added to the tx and ty 
components of the translated affine transform or multiplied by a related but different 
calibration constant and added directly to the pan and tilt counts respectively for use In 
the direct conversion into image coordinates. 

15 In a simplified version of this alternative embodiment, only the accelerometer 160 

measuring acceleration in the vertical direction is added to the pan, tilt, zoom and focus 
sensors 113, as the most common problem with supposedly stationary cameras is that 
they are mounted on unstable platforms and the vertical shift is the major problem. 

In a modification of the simplified version of the alternative embodiment, a second 
20 accelerometer 163 is fitted at the front of the lens 112 so that camera compliance or 
osdllatlon in the vertical direction, independent of tilt about the axis 146, can also be 
measured and made use of in ascertaining the direction in which the camera 110 and lens 
112 are pointing at any given time. 

In another, altemative embodiment of the invention illustrated in Fig. 19, zoom 
25 and focus sensors 149 and 153 fitted to lens 112 are the same as in the preferred 
embodiment, but tilt and pan sensors 140 and 144 are changed, and there is an additional 
rotational sensor 174, and there is an additional Radio Frequency (RF) or Infra Red (IR) 
transmitter 170 attached. The tilt sensor 144 is a plum bob potentiometer, measuring tilt 
from the normal to the local, gravitationally defined surface of the earth. The rotational 
30 sensor 174 is also a plum bob potentiometer, or a optical encoded sensor with a gravity 
sensitive zero Indicator, designed to measure the rotation of the camera around the axis 
176. The pan sensor 140 is a sensitive, electronic compass measuring the horizontal 
rotation away from a local magnetic axis, which may for instance be the local magnetic 
north. The RF or IR transmitter 170 puts out suitably shaped pulses at predetermined, 
35 precisely timed intervals, which are picked up by two or more receivers 172 located in 
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suitable positions In the stadium. By measuring tfie difference in the arrival time of the 
pulses at the receivers 172 the location of the camera in the stadium can be calculated to 
within a few millimeters. The data from the receivers 172 and the camera sensors 140, 
144, 149 and 153 is then fed to data interpreter 116 In the LVIS system. By combining the 
5 data, the system can calculate the position and orientation of the camera 110, as well as 
the focus and zoom of the lens 112. In this way a hand held or mobile camera can be 
accommodated. In the affine model representation, the earlier equations have been 
extended to include aoss terms to deal with the rotation, e.g. 

10 yi = Zyr -h pxr -h ty 

where (variables) Is a transformation constant to account for the extra rotational degree of 
freedom allowed by a hand held camera. 

In another, alternative embodiment of the invention, illustrated in Fig. 20, the 
system can handle both hand-held or mobile cameras and can determine the position of 

15 objects of interest to the sport being played. For instance, in a tennis match being played 
on court 15, the bail 80 could have a transmitter concealed in it, which may be a simple 
Radio Frequency (RF) or Infra Red (IR) transmitter, which is emitting suitably shaped 
pulses at predetermined, precisely timed intervals that are differentiated from transmitter 
170 attached to mobile camera 110, either by timing, frequency, pulse shape or other 

20 suitable means. The receivers 172, located in suitable positions in the stadium, now 
measure both the difference in the arrival time of the pulses emitted by the camera 
transmitter 170 and the object transmitter 180, The system Is now able to locate the 
instantaneous position of both the camera 110 and the ball with transmitter 180. The data 
from the camera 110 and the receivers are fed to the data gatherer 114 and then to the 

25 data interpreter 116. The data Interpreter 116 can now infer the location, orientation, 
zoom and focus of the camera 110 and lens 112, which can, as described in detail 
previously, provide search Infomiation to the LVIS system and may also be used to 
advantage In the track mode of the LVIS system. Furthermore, the data interpreter 116 
can also provide information about the location of an object of Interest 180 in the current 

30 image, which may be used, for Instance to provide viewer enhancements such a graphic 
84 on the final output showing the trajectory 182 of the object of interest. 

It is to be understood that the apparatus and method of operation taught herein 
are illustrative of the invention. Modifications may readily be devised by those skilled in the 
art without departing from the spirit or scope of the invention. 
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1. A method for tracking motion from field to field in a sequence of related 
video images that are scanned by ac least one camera, the method comprising the steps 
of: 

5 a) establishing an array of idealized x and y coordinates representing a 

reference array having a plurality of landmarks where each landmark has 
unique x and y coordinates; 
b) mapping x and y coordinates in a current Image to said x and y 
coordinates In said reference array; 
10 c) acquiring camera sensor data representing the position and orientation of 

the cameras; 

d) predicting the future location of said landmark coordinates, x' and y', 

using said camera sensor data, 
wherein prediction errors due to changes between two successive fields are 
15 minimized by adding (i) the field to field difference in landmark location calculated from 
said camera sensor data to (ii) the landmark position x, y previously located. 

2. The method of claim 1 wherein said mapping Is achieved according to the 
following relationships: 



20 where: 

X is a horizontal coordinate in the reference array, 

y Is a vertical coordinate in the reference array, 

x' is a horizontal coordinate in the current scene, 

y' is a vertical coordinate In the current scene, 
25 a is a warp parameter for horizontal translation of the object in the x direction, 

b is a warp parameter for magnification between the reference array and the 
current image in the x direction, 

c is a warp parameter for a combination of rotation and skew in the x direction, 

d is a warp parameter for vertical translation of the object in the y direction, 
30 e Is a warp parameter for a combination of rotation and skew In the y direction, 

and 

f is a warp parameter for magnification between the reference array and the 
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3. Tlie method of claim 2 wfierein said video images are vertically interiaced 
where Images from field to field altemate between lilce and unlike fields. 

5 

4. The method of claim 3 wherein said predicting the future location of said 
landmark coordinates, x* and y*, for said interiaced video images Is based on a detected 
change of position of said landmark from the previous like field. 

10 5. The method of daim 4 further comprising the steps of: 

d) searching for one of said landmarks in said current image by means of 
correlation using a template where the search is conducted over a substantial region 
spanning the predicted location of said landmark; ' 

e) multiplying the results of said correlation search in step;(d) by a weighting 
15 function giving greater weight to correlations closer in distance to the predicted location of 

said landmark to yield a weighted correlation surface; 

f) searching said weighted correlation surface for Its peak value. 

6. The method of daim 5 further comprising the steps of: 

20 g) determining new warp parameters a,b,c,d,e and f for a current image 

based on said landmark's current position in a current image weighted by said weighting 
function for that landmark, 

wherein emphasis is given to landmarks which are doser to their 
predicted position. 

25 

7. The method of daim 6 wherein said weighting function comprises the 
following relationship: 

ErrorWeight = f j — i 

h +(i((xp - xmy+(yp -ym) )) 

where: 

g,h,i,j,k, and 1 are numerical constants; 
30 xp is the predicted x coordinate location of said landmark; 

xm is the measured x coordinate position of said landmark; 
yp is the predicted y coordinate location of said landmark; and, 
ym is the measured y coordinate position of said landmark. 
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■ 8. The moethd of claim 7 further including the step of: 

h) updating said landmark locations in said reference array according to the 
location of said landmarks In said current image, 

5 wherein said updating is performed based upon well identified landmarks 

and according to said landmark weighting function. 

9. The method of claim 8 further comprising the step of 

i) establishing three types of reference arrays prior to broadcast induding; 
10 I) a code reference array having landmark coordinates equal to said 

reference landmark coordinates, 

ii) a game reference array having landmark coordinates initialty set equal 
to said code reference array coordinates, and, ' 

III) a tracking reference array having landmark coordinates initially set 
15 equal to said code reference array coordinates. 

10. The method of claim 9 further comprising the steps of: 

j) changing said tracking reference an-ay of coordinates during a broadcast; 

and, 

20 k) resetting the tracking reference array of coordinates to said game 

reference array of coordinates after a scene cut, 

11. The method of claim 10 wherein said video system is controlled by an 
operator and said method further comprises the step of: 

25 I) selectively choosing to set said current tracking reference array of 

coordinates equal to said game reference array of coordinates or to set said game 
reference array of coordinates back to said code reference array of coordinates, 

wherein said operator can update or override the game or tracking reference array 
of coordinates. 

30 

12. The method of claim 11 further comprising the steps of: 

m) establishing a set of sensor points in a pattem around the location of each 
said landmark said sensor points being able to detect changes in color and illumination; 

n) determining if said sensor points are different in color or illumination from 
35 the expected color or illumination; and, 

o) excluding said landmark from future calculattons If said color or 
illumination is substantially different from what was expected, 
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wherein said landmark is deemed to be occluded If said color or illumination at 
said sensor points Is substantially different from the expected color or illumination, 

13. The method of claim 12 wherein said correlation template is a 15 by 15 
5 pixel window. 

14, The method of claim 1 wherein said mapping Is achieved according to the 
following relationships: 

where: 

10 X is a horizontal coordinate in the reference array, 

y is a vertical coordinate in the reference an-ay, 
x' Is a horizontal coordinate in the current scene, 
y' is a vertical coordinate in the current scene, 

b is a warp parameter for magnification between the reference array and the 
15 current image, 

a is a warp parameter for horizontal translation of the object in the x direction, 

and, 

d is a warp parameter for vertical translation of the object in the y direction. 

20 15. The method of claim 4 further comprising the steps of: 

p) searching for one of said landmarks in said current image by means of 
correlation using a template where the starting point of the search is substantially 
centered at the predicted location of said landmark; 

q) performing said search beginning from said predicted location and 
25 proceeding outward looking for a match; and, 

r) discontinuing said search for said landmark when said match exceeds a 
threshold value. 

16. The method of claim 6 wherein said weighting function comprises the 
30 following relationship: 

ErrorWeight = r-rr 

LO^((xp-xmf^(yp^ymf)'' 
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xp is the predicted x coordinate location of said landmark; 
xm is the measured x coordinate position of said landmark; 
yp is the predicted y coordinate iocation of said landmark; and, 
5 ym is the measured y coordinate position of said landmark. 

17. A method of merging a primary video stream Into a secondary video stream so that 
the combined video stream appears to have a common origin from video field to 
video field even as the primary video stream is modulated by changes in camera 

10 orientation and settings, said apparent common origin achieved by using pattern 

recognition analysis of the primary video stream to stabilize and refine camera sensor 
data representing the orientation and settings of the primary video stream source 
camera, said method comprising the steps of: ' 

s) acquiring camera sensor data from at least one camera outfitted with 
15 sensors which measure the orientation and settings of the camera, 

t) converting the camera sensor data a format suitable for transmission, 
u) transmitting the converted camera sensor data to a live video insertion 
system, 

v) converting the camera sensor data to affine form, 
20 w) predicting where landmarics In the previous field of video will be in the 

current field of video based upon said camera sensor data, 
X) peri'orming correlations to detect landmark positions centered about 

landmark positions predicted by the camera sensor data, and 
y) creating a model relating a reference field of video to the current field of 
25 video using a weighted least mean square fit for all located landmarks. 

18. The method of claim 17 wherein the orientation and settings of said at least one 
camera comprise focus, zoom, pan, and tilt 

30 19. The method of daim 17 wherein the format suitable for transmission is a numeric 
series obtained by converting the acquired camera sensor data from an analog base 
to a digital base. 

20. A method of merging a primary video stream into a secondary video stream so that 
35 the combined video stream appears to have a common origin from video field to 

video field even as the primary video stream is modulated by changes in camera 
orientation and settings, said apparent common origin achieved by using pattem 
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recognition analysis of the primary viaeo stream to stabilize and refine camera sensor 
data representing the orientation ana settings of the primary video stream source 
camera, said method comprising the steps of: 

z) acquiring camera sensor data from at least one camera outfitted with 
5 sensors which measure the orientation and settings of the camera, 

aa) converting the camera sensor data a format suitable for transmission, 
bb) transmitting the converted camera sensor data to a live video insertion 
system, 

cc) converting the camera sensor data to affine form, 
10 dd) performing correlations to detect landmark positions centered about 

landmark positions predicted by the camera sensor data, 
ee) creating virtual landmarks using said camera sensor data, said virtual 

landmarks appropriately weighted for camera seinsor data error, and 
ff) creating a model relating a reference field of video to the current field of 
15 video using a weighted least mean square fit for all located and virtual 

landmarks. 



21. The method of daim 20 wherein the orientation and settings of said at least one 
camera comprise focus, zoom, pan, and tilt. 

20 

22. The method of claim 20 wherein the fomriat suitable for transmission is a numeric 
series obtained by converting the acquired camera sensor data from an analog base 
to a digital base. 
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